18 research outputs found

    Lyndon Array Construction during Burrows-Wheeler Inversion

    Get PDF
    In this paper we present an algorithm to compute the Lyndon array of a string TT of length nn as a byproduct of the inversion of the Burrows-Wheeler transform of TT. Our algorithm runs in linear time using only a stack in addition to the data structures used for Burrows-Wheeler inversion. We compare our algorithm with two other linear-time algorithms for Lyndon array construction and show that computing the Burrows-Wheeler transform and then constructing the Lyndon array is competitive compared to the known approaches. We also propose a new balanced parenthesis representation for the Lyndon array that uses 2n+o(n)2n+o(n) bits of space and supports constant time access. This representation can be built in linear time using O(n)O(n) words of space, or in O(nlogn/loglogn)O(n\log n/\log\log n) time using asymptotically the same space as TT

    Space efficient merging of de Bruijn graphs and Wheeler graphs

    Full text link
    The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of the art algorithm for the same problem but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds. In the second part of the paper we consider the more general problem of merging succinct representations of Wheeler graphs, a recently introduced graph family which includes as special cases de Bruijn graphs and many other known succinct indexes based on the BWT or one of its variants. We show that Wheeler graphs merging is in general a much more difficult problem, and we provide a space efficient algorithm for the slightly simplified problem of determining whether the union graph has an ordering that satisfies the Wheeler conditions.Comment: 24 pages, 10 figures. arXiv admin note: text overlap with arXiv:1902.0288

    Lossy Compressor preserving variant calling through Extended BWT

    Full text link
    A standard format used for storing the output of high-throughput sequencing experiments is the FASTQ format. It comprises three main components: (i) headers, (ii) bases (nucleotide sequences), and (iii) quality scores. FASTQ files are widely used for variant calling, where sequencing data are mapped into a reference genome to discover variants that may be used for further analysis. There are many specialized compressors that exploit redundancy in FASTQ data with the focus only on either the bases or the quality scores components. In this paper we consider the novel problem of lossy compressing, in a reference-free way, FASTQ data by modifying both components at the same time, while preserving the important information of the original FASTQ. We introduce a general strategy, based on the Extended Burrows-Wheeler Transform (EBWT) and positional clustering, and we present implementations in both internal memory and external memory. Experimental results show that the lossy compression performed by our tool is able to achieve good compression while preserving information relating to variant calling more than the competitors. Availability: the software is freely available at https://github.com/veronicaguerrini/BFQzip.Comment: Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologie

    A Grammar Compression Algorithm based on Induced Suffix Sorting

    Full text link
    We introduce GCIS, a grammar compression algorithm based on the induced suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution builds on the factorization performed by SAIS during suffix sorting. We construct a context-free grammar on the input string which can be further reduced into a shorter string by substituting each substring by its correspondent factor. The resulting grammar is encoded by exploring some redundancies, such as common prefixes between suffix rules, which are sorted according to SAIS framework. When compared to well-known compression tools such as Re-Pair and 7-zip, our algorithm is competitive and very effective at handling repetitive string regarding compression ratio, compression and decompression running time

    External memory BWT and LCP computation for sequence collections with applications

    Get PDF
    We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external memory and in the process it also computes the LCP values. We show that our algorithm performs O(n maxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm outperforms the current best algorithm for collections of sequences with different lengths and when the average LCP of the collection is relatively small compared to the length of the sequences. In the second part of the paper, we show that our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP arrays, provide simple, scan based, external memory algorithms for three well known problems in bioinformatics: the computation of the all pairs suffix-prefix overlaps, the computation of maximal repeats, and the construction of succinct de Bruijn graphs

    Analysis of structural brain asymmetries in attention-deficit/hyperactivity disorder in 39 datasets

    Get PDF
    Objective Some studies have suggested alterations of structural brain asymmetry in attention-deficit/hyperactivity disorder (ADHD), but findings have been contradictory and based on small samples. Here, we performed the largest ever analysis of brain left-right asymmetry in ADHD, using 39 datasets of the ENIGMA consortium. Methods We analyzed asymmetry of subcortical and cerebral cortical structures in up to 1,933 people with ADHD and 1,829 unaffected controls. Asymmetry Indexes (AIs) were calculated per participant for each bilaterally paired measure, and linear mixed effects modeling was applied separately in children, adolescents, adults, and the total sample, to test exhaustively for potential associations of ADHD with structural brain asymmetries. Results There was no evidence for altered caudate nucleus asymmetry in ADHD, in contrast to prior literature. In children, there was less rightward asymmetry of the total hemispheric surface area compared to controls (t = 2.1, p = .04). Lower rightward asymmetry of medial orbitofrontal cortex surface area in ADHD (t = 2.7, p = .01) was similar to a recent finding for autism spectrum disorder. There were also some differences in cortical thickness asymmetry across age groups. In adults with ADHD, globus pallidus asymmetry was altered compared to those without ADHD. However, all effects were small (Cohen’s d from −0.18 to 0.18) and would not survive study-wide correction for multiple testing. Conclusion Prior studies of altered structural brain asymmetry in ADHD were likely underpowered to detect the small effects reported here. Altered structural asymmetry is unlikely to provide a useful biomarker for ADHD, but may provide neurobiological insights into the trait

    Optimal Suffix Sorting And Lcp Array Construction For Constant Alphabets

    No full text
    Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)We show how the longest common prefix (LCP) array can be generated as a by-product of the suffix array construction algorithm SACA-K (Nong, 2013). Our algorithm builds on Fischer's proposal (Fischer, WADS'11), and also runs in linear time, but uses only constant extra memory for constant alphabets. (C) 2016 Elsevier B.V. All rights reserved.1183034CAPESCNPq [162338/2015-5]Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq

    Space Efficient Merging of de Bruijn Graphs and Wheeler Graphs

    No full text
    The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of the art algorithm for the same problem but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds. In the second part of the paper we consider the more general problem of merging succinct representations of Wheeler graphs, a recently introduced graph family which includes as special cases de Bruijn graphs and many other known succinct indexes based on the BWT or one of its variants. In this paper we provide a space efficient algorithm for Wheeler graph merging; our algorithm works under the assumption that the union of the input Wheeler graphs has an ordering that satisfies the Wheeler conditions and which is compatible with the ordering of the original graphs

    Algorithms to compute the Burrows-Wheeler similarity distribution

    No full text
    The Burrows-Wheeler transform (BWT) is a well studied text transformation widely used in data compression and text indexing. The BWT of two strings can also provide similarity measures between them, based on the observation that the more their symbols are intermixed in the transformation, the more the strings are similar. In this article we present two new algorithms to compute similarity measures based on the BWT for string collections. In particular, we present practical and theoretical improvements to the computation of the Burrows-Wheeler Similarity Distribution for all pairs of strings in a collection. Our algorithms take advantage of the BWT computed for the concatenation of all strings, and use compressed data structures that allow reducing the running time with a small memory footprint, as shown by a set of experiments with real and artificial datasets782145156CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO - CNPQFUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DE SÃO PAULO - FAPESP303012/2015-3; 425340/2016-3; 310685/2015-02017/09105-0; 2018/21509-2; 2015/50122-
    corecore